July 26, 2016

OUTLINE

OUTLINE - Tree and Random Forest

  • Tree-structured Models (a.k.a. Decision Tree)
    • CART
    • GUIDE
    • PARTY
    • Extentions of Decesion Tree: Subgroup Identification
    • Case Studies using R
  • Random Forests
    • Bremen’s Random Forests
    • Random forests with conditional inference
    • GUIDE Forests (options)
    • Extensions
    • Case Studies using R

OUTLINE - Trees with Neural Net

  • Trees as a nonparametric models
    • Trees as a K-NN
    • Trees as an additive model
    • Trees as a neural networks
  • Neural random forests

  • Deep neural decision forests

Decision Tree Guided Tour

What is a Decision Tree?

What is a Decision Tree?

What is a Decision Tree?

A decision tree is a logical model represented as a tree that shows how the value of a target variable can be predicted by using the values of a set of predictor (input) variables.

A decision tree recursively partitions the data and sample space to construct the predictive model.

Its name derives from the practice of displaying the partitions as a decision tree, from which the roles of the predictor variables may be inferred.

A classification or regression tree is a prediction model that can be represented as a decision tree.

A tree-structured classifier is a decision tree for predicting a class variable from one or more predictor variables.

A regression tree model is a nonparametric estimate of a regression function constructed by recursively partitioning a data set with the values of its predictor variables.

Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields.

Motivation Example of Decision Tree

Morgan and Sonquist (1964, JASA)

Motivation Example of Decision Tree

Morgan and Sonquist (1964, JASA)

Motivation Example of Decision Tree

Morgan and Sonquist (1964, JASA)

Elements of Decision Tree

An empirical tree represents a segmentation of the data that is created by applying a series of split rules. Each split rule assigns an observation to a segment based on the values of inputs.

One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.

The original segment contains the entire data set and is called the root node of the tree. A node with all its successors forms a branch of the node that created it.

The nodes that have child nodes are called interior (or internal, intermediate) nodes. The final nodes that do not have child nodes are called leaves (or terminal nodes). For each leaf, a decision is made and applied to all observations in the leaf.

The type of decision depends on the context. In predictive modeling, the decision is simply the predicted value.

Building and utilizing a Decision Tree

  • Splitting recursively
    • Partition the data and sample space recursively to construct a predictive model.
    • Split selection: split variable selection and split set selection
  • Determining a tree size
    • Partitioning continues until stopping criteria are met. (Stopping criteria: sample size, impurity, improvement, and so on)
    • Partitioning continues until node sample sizes are too small, and then prune insignificant nodes that cause over-fitting.
  • Interpretation and Prediction
    • Interpret the final predictive tree model with the significant split variables.
    • Predict target values of new data using the final predictive tree model.

Split Rule Selection

  • Split rule: consisting of a split variable and a split point (or set).
    • \(X \leq c\) for ordered variable
    • \(X \in A\) for unordered variable
  • Split rule: determined by examining all possible splits (exhaustive search) or performing statistical tests

  • Good split rule: the data being homogeneous within each subnode and heterogeneous between subnodes.

Determination of Tree Size

  • Naïve stopping rule: Stop splitting if
    • a node sample size is small,
    • a node impurity is low,
    • or an improvement is minor.
  • Pruning rule:
    • Step 1: Stop splitting if a node sample size is small.
    • Step 2: Prune off unnecessary branches.

Some Review Paper

  • Loh, W.-Y. (2008). Regression by parts: Fitting visually interpretable models with GUIDE. In Handbook of Data Visualization, C.Chen, W.Härdle, and A.Unwin, Eds. Springer, pp.447-469.

  • Loh, W.-Y. (2008). Classification and regression tree methods. In Encyclopedia of Statistics in Quality and Reliability, F.Ruggeri, R.Kenett, and F.W. Faltin, Eds. Wiley, Chichester, UK, pp.315-323.

  • Loh, W.-Y. (2010). Tree-structured classifiers. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 364-369.

  • Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14-23.

  • Loh, W.-Y. (2014). Fifty years of classification and regression trees (with discussion). International Statistical Review.

  • Merkle, E.C. and Shaffer, V.A. (2011). Binary recursive partitioning: Background, methods, and application to psychology, British Journal of Mathematical and Statistical Psychology, 64, 161–181.

  • Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415–434.

  • Strobl, C., Malley, J. and Tutz, G. (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Psychological Methods, 14(4), 323–348.

Major Tree Algorithms

Major Regression Tree Algorithms

  • Piecewise constant least squares models: AID (Morgan and Sonquist, 1964), CART, RPART, GUIDE (Loh, 2002)

  • Piecewise linear least squares, polynomial regression, quantile regression, and longitudinal and multi-responses data effects: Loh (2014) with discussion

  • Subgroup identification of differential treatment effects: Loh et al. (2014), Hothorn et al. (2014)

  • Others: M5 (Quinlan, 1992), MOB (Zeileis et al., 2008), MELT (Eo and Cho, 2014), KAPS (Eo et al., 2015)

  • Missing values, selection bias, accuracy, speed, and tree complexity

  • Diagnostic tool: Simonoff (2013)

Notations

  • Basic notations

  • More notations

CART: Node Impurity Measures (Node Model)

CART: Split Set Selection

CART: Split Set Selection

CART: Cost-Complexity Pruning

CART: Subtree Selection by \(V\)-fold cross-validation

CART: \(V\)-fold cross-validation

CART: \(k\)-SE Rule

CART: Surrogate Splits

CART: Surrogate Splits Algorithm

CART: Uses of Surrogate Splits

  • Enable tree construction when there are missing values in the learning sample

  • Enable classification of new cases with missing values

  • Rank variables by their order of importance (not available in rpart)

  • Detect masking of variables

CART: Importance Ranking of predictor variables

CART: Problems

  • Biased toward variables with more splits: A k-valued ordered variable has (k − 1) splits; a k-valued categorical variable has (2k−1 − 1) splits.

  • Biased toward predictors with more missing values: Split method uses only proportions of nonmissing cases—it ignores the number of missing values. A variable taking a unique value for exactly one case in each class and missing on all other cases yields the largest decrease in impurity. Bias exists for surrogate splits too.

  • Computation: Impractical when there are three or more classes and categorical variables with many values. Note: Because CART and RPART encode each categorical variable split with a 32-bit binary integer, they do not properly deal with categorical variables having more than 32 values.

  • Prediction accuracy: Often no better than linear discriminant analysis.

GUIDE: Residual-based Tree-structured Modeling (Residual Analysis)

GUIDE: Split Variable Selection

GUIDE: Split Variable Selection

GUIDE: Split Set Selection

GUIDE: Importance Ranking

GUIDE: Importance Ranking

CTREE/MOB: Tree-structured Models with Conditional Inference

CTREE/MOB: Generic Algorithm

CTREE/MOB: Statistical Testing

CTREE/MOB: Selection of Split Variable

CTREE/MOB: Parameter Instability Test

CTREE/MOB: Pruning

Subgroup Identification via Recursive Partitioning

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

At each node, a case goes to the left child node if stated condition is satisfied. Sample sizes are beside terminal nodes.
95% bootstrap intervals for relative risk of treatment vs. placebo below nodes.

Subgroup Analyses

Identifying groups of patients for whom the treatment has a different effect than for others. Effect is:

– Stronger
– Lower
– Contrary

than the average treatment effect.

Suitable models promise better prediction of treatment effect and thus individualised treatments.

Regression trees are natural for subgroup identification

  • subgroups are defined by terminal nodes of a tree.

Key Ideas

  • Use Piecewise-Linear Models
    • Use Poisson regression to fit a proportional hazards model to each node
  • Examine residual patterns for each treatment level

  • Why group ordinal variables?
    • Grouping values of ordinal X variables may result in power loss
    • But grouping allows missing values to be used!
  • Test for treatment interaction

  • Perturb population instead of dat

Key Ideas: Use Piecewise-Linear Models

Key Ideas: Examine Residual Patterns

Key IdeasL Perturb Poputation

GUIDE Subgroup Identification Algorithm

Case Studies using R

Implementation

See partykit package in order to build a tree-structured model

tree::tree
rpart::rpart
mvpart::mvpart
party::mob
psychotree:psychotree
betareg::betatree
RWeka::M5
evtree::evtree
REEMtree::REEMtree
vcrpart::vcrpart
kaps::kaps melt::melt

Random Forest Guided Tour

Motivation

  • Tree-structured Modeling
    • Weak learner
    • Large variance
  • Key Idea
    • \(Var(X) = \sigma^2 \rightarrow Var(\frac{1}{n}\sum_{i=1}^{n}X_i) = \frac{\sigma^2}{n}\)
    • with resampling techniques

History

History of Random Forest (RF)

Breiman's RF Algorithm (2001)

Figure 1 of Boulesteix et al. (2012)

Parameters for RF algorithm

There are three parameters for RF algorithms:

  • \(a_n \in \{1, \ldots, n\}\): the number of sampled data points in each tree
  • \(mtry \in \{1, \ldots, p\}\): the number of possible directions for splitting at each node of each tree
  • \(nodesize \in \{1, \ldots, a_n \}\): the number of examples in each cell below which the cell is not split

By default, in the regression mode of the randomForest package, the parameter
mtry = ceiling(p / 3), a_n = n, and nodesize = 5.

Breiman's RF Algorithm (2001)

Algorithm 1 of Biau and Scornet (2016)

Variable Importance

RF can be used to rank the importance of variables via two measures of significance:

  • Gini importance a.k.a Mean Decrease Impurity (Breiman, 2003a)
    • the total decrease in node impurity from splitting on the variablce, averaged over all trees
    • yielding a high DGI when selected, leading to a high Gini VIM.
  • Permutation Importance a.k.a. Mean Decrease Accuracy (Breiman, 2001)
    • if the variable is not important, then rearranging its values shold not degrade prediction accuracy.
    • the difference between the OOB error resulting from a data set obtained through random permutation of the predictor of interest and the OOB error resulting from the original dataset.
    • Increase the OOB error, leading to a high permutation VIM

For more detail, see [Goldstein et al. (2014)].

VIM: MDA

Set \({\bf X} = (X^{(1)}, \ldots, X^{(p)}).\)
For a forest resulting from the aggregation of \(M\) trees, the MDI of the variable \(X^{(j)}\) is defiend by
\[ \hat{MDI}(X^{(j)}) = \frac{1}{M} \sum_{l = 1}^{M} \sum_{t \in \mathrf{T}_l} p_{n,t} L_{reg, n} (j_{n,t}, z_{n,t}), \] where

VIM: MDI

Conditional Forest

GUIDE Forest

For more details, See [the JSCS paper].

Extensions of Random Forests

Survival Forests

Online Forests

Online learning is a method in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training dataset at once

  • Yi et al. (2012) propose Information Forests whose construction consists in deferring classification until a measure of classification confidence is sufficiently high, and in fact break down the data so as to maximize this measure.
  • Lakshminarayanan et al. (2014) porpose Mondrian Forests which are grown in an online fashion and achieve competitive predicive performance comparable with other online random forests while being faster.

Missing Data

One of the strengths of RF is that they can handle missing data.

Hapfelmeier's Algorithm

  1. Compute the OOB prediction error of a tree
  2. Randomly assign each observation with \(p_k\) to the child nodes of a node k that uses \(X_i\) for its primary split.
  3. Recompute the OOB prediction error of the tree, following step 2.
  4. Compute the difference between the original and recomputed OOB prediction errors.
  5. Repeat steps 1-4 for each tree and the average difference of OOB prediction errors over all trees as the overall estimate.
  • Breiman (2003) utilize \(proximity matrix\), which measures the proximity between pairs of observations in the forest, to estimate missing values.
  • Rieger et al. (2010) introduce data imputation based on random forests.In sum, there is no clear advantage for either \(knn\)-imputation or surrogate variabels for dealing with missing values in the predictor variables.
  • Hapfelmeier et al. (2014a; 2014b) propose new variable importance measures when a covariate contains missing.

Unbalanced Data

RF can naturally be adapted to fit the unbalanced data framework by down-sapling the majority class and growing each tree on a more balanced dataset (Chen et al. 2004; Kuhn and Johnson, 2013).

  • Recall that RF is a tree ensemble method. A large number of bootstrap samples are taken from the training data and a separate unpruned tree is crated for each dataset.
  • This means that if there are n training set instances, the resulting sample will select n samples with replacement
  • To incorporate down-sampling, random forest can take a random sample of size \(c \times nmin\), where c is the number of classes and nmin is the number of samples in the minority class.

Fink et al. (2010) use RF for which each tree is traned and allowed to prediction on a particular region in space and time.

Single Class Data (One Class Classification)

One-class classification (unary classification) tries to identify objects of a specific class amongst all objects, by learning from a training set containing only the objects of that class.

  • Desir et al. (2013) study the One Class Random Forests, which is designed to solve this particular problem.
  • Geremia et al. (2013) proposed Spatially Adaptive Random Forest to deal with semantic image segmentation applied to medical imaging protocols.

Unsupervised Learning

  • Define an RF dissimilarity measure between unlabeled data.
    • The idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data.
    • A synthetic class outcome is defined by labeling the observed data by class 1 and the synthetic data by class 2. By restricting the resulting labeled similarity measure to the observed data, one can define a similarity measure between unlabeled observations. The similarity measure strongly depends on how the synthetic observations are generated.
    • See Shi and Hovath (2007; JCGS).
  • Yan et al. (2013) present a new clustering method called Cluster Forests which randomly probes a high-dimensinoal data cloud to obtain good local clusterings, then aggregates via spectral clustering to obtain cluster assignments for the whole dataset.

Case Studies using R packages

Implementations in R

Tree-structured Models Random Forests
tree::tree
rpart::rpart randomForest::rf
mvpart::mvpart randomForestSRC
party::mob Rborist
psychotree:psychotree ranger::ranger
betareg::betatree party::cforest
RWeka::M5 quantregForest::
evtree::evtree LogicForest::
REEMtree::REEMtree
vcrpart::vcrpart
melt::melt

For more details, see CRAN Task View

DATA: CAR dataset

rpart

mxnet

Trees with Relationship to Other Methods

\(K\)-Nearest Neighbors Algoirthm

  • Defer the decision to generalize beyond the training examples till a new query is encountered
  • Whenever we have a new point to classify, we find its K nearest neighbors from the training data.

  • Algorithm
    • For each training observation, add the data to the list of training observations
    • Given a query instance \(x_q\) to be classified, let \(x_1, \ldots, x_k\) denote the \(k\) instances from training observations that are nearest to \(x_q\).
    • Retrun the class that represents the maximum of the \(k\) instances

Trees as a Nearest Neighbors

- RF can be viewed as weighted neighborhoods schemes.
+ \(\hat{y} = \sum_{i=1}^{n} W(x_i, x') y_i\).
+ where \(W(x_i, x')\) is the non-negative weight of the \(i\)th training point relative to the new point \(x'\).
+ In \(k\)-NN, the weight \(W(x_i, x') = \frac{1}{k}\) if \(x_i\) is one of the \(k\) points closest to \(x'\), and zero otherwise.
+ In RF, \(W(X_i, x') = \frac{1}{k'}\) if \(x_i\) is one of the \(k'\) points in the same leaf as \(x'\), and zero otherwise.
+ Since a forest averages the predictions of a set of m trees with individual weight functions \(\hat{y} = \sum_{i=1} (\frac{1}{m}\sum_j=1}^{m} W_j(x_i, x') )\).

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points \(x_{i}\) which fall in the same leaf as x' in at least one tree of the forest. In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set.

For more detail, See Lin and Jeon (2006).

Trees as a Neural Netroks

Trees as a Neural Netroks: First Hidden Layer

Trees as a Neural Netroks: Second Hidden Layer

Trees as a Neural Netroks: Output Layer

RF as a Neural Networks

Neural random forests

Deep neural decision forests

  • Stochastic & differentiable RF
  • Unifies representation learning and classifier
  • Moves away from greedy, local construction of traditional RFs
  • Learning of split node parameters using SGD & backprop
  • Given state of tree/network, get leaf predictions

See the talk slide